Power house of neural networks

Perceptron
A perceptron, the basic unit of a neural network, comprises essential components that collaborate in information processing:

z = \sum_{i=1}^n w_i \cdot x_i + b = w \cdot \mathbf{x} + b

It is important to note that the weight of input is indicative of the strength of a node. Similarly, an input’s bias value gives the ability to shift the activation function curve up or down.

During training, the perceptron learns by adjusting its weights and bias based on a learning algorithm. A common approach is the perceptron learning algorithm, which updates weights based on the difference between the predicted output and the true output.

These components work together to enable a perceptron to learn and make predictions. While a single perceptron can perform binary classification, more complex tasks require the use of multiple perceptrons organized into layers, forming a neural network.

Perceptron

1. How the Perceptron Trick Works

  1. Initialization:
    • The perceptron initializes the weights w and bias b, often randomly.
  2. Prediction (Forward Pass):
    • The perceptron takes an input feature vector $ $, calculates the weighted sum

      \implies w \cdot \mathbf{x} = \sum{w_i x_i} = w_1 x_1 + w_2 x_2 + ... + w_n x_n

    • Adding a special term called bias b to this weighted sum to improve the model’s performance:

      \implies z = w \cdot \mathbf{x} + b

      and applies the activation function to produce an output which gives either in binary form or a continuous value as follows:

      \implies \phi(z) = \begin{cases} 1, & \text{if } z > 0 \\ 0, & \text{if } z \leq 0 \end{cases}

      \implies Y = \phi{(z)} = \phi{\big( w \cdot \mathbf{x} + b \big)}

  3. Calculate the Loss:
    • The perceptron compares its predicted output to the actual label (ground truth) and computes the loss. For a perceptron, the loss function is simple: it checks whether the prediction was correct or not.
  4. Update the Weights (Learning):
    • If the prediction is correct, no update is needed. If the prediction is incorrect, the perceptron updates the weights and bias based on the difference between the prediction and the true label.

1.1 Perceptron Learning Algorithm (Training Process) :

The perceptron is trained using a process called error-driven learning, where the weights are updated based on the error made during predictions. The perceptron uses stochastic gradient descent (SGD) to update weights.

Here’s the update rule:

\implies w = w + \eta \cdot (y - \hat{y}) \cdot \mathbf{x}
\implies b = b + \eta \cdot (y - \hat{y})

Where:

  • \eta is the learning rate, a small positive number controlling how much to change the weights.
  • y is the true label (either 1 or 0).
  • \hat{y} is the predicted output from the perceptron.
  • \mathbf{x} is the input vector.

1.2 Example of a Perceptron Trick:

Let’s walk through an example to see how a perceptron works step by step.

Dataset:

We have two-dimensional data points (features $ x_1 $ and $ x_2 $) and their corresponding labels:

x_1 x_2 Label y
2 3 1
1 1 0
2 1 1
0 1 0
Step-by-Step Process:
  1. Initialize Weights and Bias:
    • Let’s assume w_1 = 0, w_2 = 0, and b = 0.
  2. Predict Output:
    • For the first data point \mathbf{x} = [2, 3], true label y = 1:
      • Compute the weighted sum:
        z = (w_1 \cdot x_1) + (w_2 \cdot x_2) + b = (0 \cdot 2) + (0 \cdot 3) + 0 = 0.
      • Apply the step function:
        f(z) = 0 because z \leq 0.
      • The perceptron predicts 0, but the true label is 1, so it’s misclassified.
  3. Update Weights:
    • Apply the weight update rule:
      w_1 = 0 + 0.1 \cdot (1 - 0) \cdot 2 = 0.2
      w_2 = 0 + 0.1 \cdot (1 - 0) \cdot 3 = 0.3
      b = 0 + 0.1 \cdot (1 - 0) = 0.1
  4. Continue for All Data Points:
    • Repeat this process for all data points in the dataset, updating weights and bias when there are misclassifications.
  5. Convergence:
    • The perceptron will eventually converge if the data is linearly separable.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=2, n_informative=1,n_redundant=0,
                           n_classes=2, n_clusters_per_class=1, random_state=40,hypercube=False,class_sep=10)
sns.scatterplot(x=X[:,0], y=X[:,1], hue=y,s=100)
plt.xlabel("x1")
plt.ylabel("x2")
plt.show()

def perceptron(x,y, epochs = 100):
    x = np.insert(x, 0, 1, axis = 1)
    weights = np.ones(x.shape[1])
    lr = 0.1

    for _ in range(epochs):
        j = np.random.randint(0, x.shape[0])
        z = np.dot(x[j], weights)
        y_pred = 0 if z <= 0 else 1
        error = y[j] - y_pred
        weights = weights + lr * error * x[j]

    return weights[0], weights[1:]
bias, weight = perceptron(X,y, epochs=100)
bias, weight
(1.2000000000000002, array([0.55762365, 0.87483179]))

this bias and weight are in general from i.e., w_0 + w_1x_1 + w_2x_2 = 0
converting it into slope intercept form i.e., x_2 = m x_1 + b, using
m = \Large\frac{-w_1}{w_2}   and   b = \Large\frac{-w_0}{w_2}

m = -weight[0]/weight[1]
b = -bias/weight[1]
x_input = np.linspace(-3, 3, 100)
y_input = m * x_input + b


sns.scatterplot(x=X[:,0], y=X[:,1], hue=y,s=100)
sns.lineplot(x=x_input, y=y_input, color = 'black')
plt.xlim(-3,2)
plt.ylim(-4,3)
plt.title("Epochs = 100")
plt.show()

bias, weight = perceptron(X,y, epochs=1000)
m = -weight[0]/weight[1]
b = -bias/weight[1]
x_input = np.linspace(-3, 2, 100)
y_input = m * x_input + b


sns.scatterplot(x=X[:,0], y=X[:,1], hue=y,s=100)
sns.lineplot(x=x_input, y=y_input, color = 'black')
plt.xlim(-3,2)
plt.title("Epochs = 1000")
plt.show()

Code
from matplotlib.animation import FuncAnimation  
from IPython.display import HTML
fig = plt.figure()  
def perceptron_new(X,y, epochs = 100):
    x = np.insert(X, 0, 1, axis = 1)
    weights = np.ones(x.shape[1])
    lr = 0.1
    history = []
    for _ in range(epochs):
        j = np.random.randint(0, x.shape[0])
        z = np.dot(x[j], weights)
        y_pred = 0 if z <= 0 else 1
        error = y[j] - y_pred
        weights = weights + lr * error * x[j]
        history.append(weights)
    return history

histry = pd.DataFrame(perceptron_new(X,y, epochs = 500))
all_b = -histry.iloc[:,0]/histry.iloc[:,2]
all_m = -histry.iloc[:,1]/histry.iloc[:,2]

fig, ax = plt.subplots(figsize=(9,5))
fig.set_tight_layout(True)

x_input = np.linspace(-3, 3, 100)
sns.scatterplot(x=X[:,0], y=X[:,1], hue=y,s=100, ax = ax)
line, = ax.plot(x_input, x_input*50 - 4, 'r-', linewidth=2)
ax.set_ylim(-5, 3)

def update(i):
    label = 'epoch {0}'.format(i + 1)
    line.set_ydata(x_input*all_m[i] + all_b[i])
    ax.set_title(label)
    # return line, ax

anim = FuncAnimation(fig, update, repeat=False, frames=500, interval=10)
plt.close()
HTML(anim.to_html5_video())
<Figure size 640x480 with 0 Axes>

1.3 Problems with perceptron trick

  1. No method to decide which perceptron model is better.
  2. The perceptron will only converge if your training data is linearly separable.
  3. Best solution is not always optimal because we are choosing random points for weight updation.
  4. There may be chances that some training points get unnoticed during convergence.
  5. Single Perceptron is limited to Binary Classification.

This is the reason why we use Loss functions in Machine Learning, as it help us to quantify that which model is better.

2. Perceptron Loss Function

So,

  • The perceptron loss function measures how much the perceptron’s prediction differs from the true label. It’s specifically designed to penalize misclassified points.
  • The Perceptron typically uses stochastic gradient descent (SGD), meaning that the weights are updated after each misclassified training example, instead of waiting to update after computing the loss over the entire dataset.

The perceptron loss using SGD with model parameters is:

\large L(w, b) = \frac{1}{n} \sum_{i=1}^{n} \max(0, - y_i f(\mathbf{x_i}))

or
\large L(w, b) = \begin{cases} 0, & \text{if } y_i (w.\mathbf{x_i}+b) > 0 \\ -y_i(w.\mathbf{x_i}+b), & \text{if } y_i (w.\mathbf{x_i}+b) \leq 0 \end{cases}
      —— eq. (1)

where,    f(\mathbf{x_i}) = w.\mathbf{x_i}+b,
  n = no. of samples,
  \mathbf{x_i} = input vector,
  y_i = The true label

We minimize the loss by adjusting the weights and bias using the gradient of the loss function with respect to the weights and bias.
In simple words, we need those values of w and b for which this L will get minimize.

  • If the data point is correctly classified, the loss is zero.

  • If the data point is misclassified, the loss is greater than zero, and the weights are updated to correct the misclassification.

Example, if we have a data like

_ \mathbf{x}_1 \mathbf{x}_2 \mathbf{y}
1. x_{11} x_{12} y_1
2. x_{21} x_{22} y_2

then Loss,

\displaystyle L = \frac{1}{2} \big[max(0, -y_1f(x_1) + max(0, -y_2f(x_2)\big]

where, f(x_1) = w_1x_{11} + w_2x_{12}+b
and    f(x_2) = w_1x_{21} + w_2x_{22}+b

2.1 Gradient Descent

The gradient descent for a neural network is used to adjust the weights in order to minimize the loss function.
The formula for updating the weights is:

\large w^{(t+1)} = w^t - \eta ⋅ \large \large \frac{\partial L}{\partial w}       —— eq. (2)

Using eq. (1), the gradient for perceptron is:

\Large \frac{\partial L}{\partial w} = \begin{cases} 0, & \text{if } y_i (w.\mathbf{x_i}+b)> 0 \\ -y_i\mathbf{x_i}, & \text{if } y_i(w.\mathbf{x_i}+b) \leq 0 \end{cases}

If y_i is misclassified, the perceptron updates the weights as follows:

\large w^{(t+1)} = w^t + \eta ⋅ y_i ⋅ \mathbf{x_i}
      —— eq. (3)

2.2 Example of Perceptron Loss and Weight Update

Let’s consider a small example to illustrate the perceptron loss and weight update process.

Data:

  • Input vector:
    \mathbf{x}_1 = [1, 2]
  • True label:
    y_1 = 1
  • Initial weights:
    w = [0.5, -0.5]
  • Bias:
    b = 0
Step 1: Compute weighted sum (score):
w\cdot \mathbf{x}_1 + b = (0.5 \times 1) + (-0.5 \times 2) + 0 = 0.5 - 1 = -0.5

Step 2: Check classification:
Since -0.5 \leq 0, the point is misclassified.

Step 3: Compute the loss:
L(w, \mathbf{x}_1, y_1) = - y_1 \cdot (w \cdot \mathbf{x}_1 + b) = -1 \cdot (-0.5) = 0.5

Step 4: Update weights:
Assuming learning rate \eta = 0.1:

w := w + \eta \cdot y_1 \cdot \mathbf{x}_1

w := [0.5, -0.5] + 0.1 \times 1 \times [1, 2] = [0.5, -0.5] + [0.1, 0.2] = [0.6, -0.3]

Now the weights have been updated to better classify the data point.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0,
                           n_clusters_per_class=1, random_state=41, n_classes=2)
y[y == 0]= -1
sns.set_style("darkgrid")
sns.scatterplot(x=X[:,0], y=X[:,1], hue=y,s=100, palette = {1:"red", -1:"green"})
plt.xlabel("x1")
plt.ylabel("x2")
plt.show()

def perceptron_loss(x,y, epochs = 1000):
    x = np.insert(x, 0, 1, axis = 1)
    weights = np.ones(x.shape[1])
    lr = 0.2

    for _ in range(epochs):
        for j in range(x.shape[0]):
            # j = np.random.randint(0, x.shape[0])
            z = np.dot(x[j], weights)
            if z*y[j] <= 0:
                weights = weights + lr * y[j] * x[j]

    return weights[0], weights[1:]
ep = 1000
bias, weight = perceptron_loss(X,y, epochs=ep)
m = -weight[0]/weight[1]
b = -bias/weight[1]
x_input = np.linspace(-3, 2, 100)
y_input = m * x_input + b


sns.scatterplot(x=X[:,0], y=X[:,1], hue=y,s=100, palette = {1:"red", -1:"green"})
sns.lineplot(x=x_input, y=y_input, color = 'black')
plt.xlim(-3,3)
plt.ylim(-4,4)
plt.title(f"Epochs = {ep}")
plt.show()

2.3 Advantages of the Perceptron:

  1. Simple and Easy to Understand:
    • The perceptron is one of the simplest machine learning algorithms, making it easy to understand and implement.
  2. Efficient for Linearly Separable Data:
    • The perceptron is efficient for solving problems where the data can be separated by a straight line (i.e., linearly separable problems).
  3. Incremental Learning:
    • The perceptron can learn from data incrementally, updating its weights with each new data point, which is useful for online learning.
  4. Foundation for More Complex Models:
    • The perceptron laid the groundwork for more advanced neural networks like multi-layer perceptrons (MLPs), where multiple layers of perceptrons can be stacked to handle non-linear data.

2.4 Disadvantages of the Perceptron:

  1. Only Works for Linearly Separable Data:
    • The perceptron cannot solve problems that are not linearly separable, such as the XOR problem.
  2. No Non-Linearity:
    • The perceptron uses a linear decision boundary, which makes it unsuitable for capturing complex, non-linear patterns in data.
  3. Limited to Binary Classification:
    • The basic perceptron can only solve binary classification problems (two classes).
  4. Sensitive to Feature Scaling:
    • The perceptron’s performance can be highly dependent on the scale of the input features. Features with very different scales can lead to inefficient learning unless they are properly normalized.
  5. No Probabilistic Interpretation:
    • The perceptron outputs binary decisions (0 or 1) and does not provide any probabilistic information or measure of confidence in its predictions.

2.5 Types of the Perceptron:

Perceptrons can be categorized into two types based on their complexity:

  1. Single Layer Perceptron :
    • This is the basic form of the perceptron, consisting of a single layer of output neurons connected directly to input features.
    • It can only solve linearly separable problems, meaning it cannot handle non-linear data.
  2. Multi-layer Perceptron (MLP) :
    • This is a more advanced version of the perceptron with one or more hidden layers between the input and output layers.
    • The inclusion of hidden layers allows MLPs to handle non-linearly separable problems.

Summary:

The perceptron algorithm is simple and easy to understand, but it has significant limitations, particularly with non-linearly separable data, data scaling, and multi-class problems. While it provides an important foundation in the history of neural networks, modern machine learning tasks often require more sophisticated algorithms like multi-layer perceptrons, logistic regression, or support vector machines.

3. Activation functions

  • The purpose of an activation function is to introduce non-linearity into the model, allowing the network to learn and represent complex patterns in the data. Without non-linearity, a neural network would essentially behave like a linear regression model, regardless of the number of layers it has.
  • The activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it.

3.1 Sigmoid Activation Function

\large\sigma (x) = \frac {1}{1 + e^{-x}}

  • Range: (0, 1)
  • Usage: Commonly used in binary classification problems (e.g., logistic regression).
  • Interpretation: It maps the input to a value between 0 and 1, making it useful in models that need probability outputs.
  • Advantage:
    1. The sigmoid function’s output can be interpreted as a probability, making it useful for binary classification tasks.
    2. The sigmoid function is differentiable, making it easy to optimize the network by adjusting the weights and biases of the neurons using gradient descent.
  • Drawback: It suffers from the “vanishing gradient problem” — when the input becomes very large or very small, the gradient becomes almost zero, slowing down learning.
Code
import numpy as np
import matplotlib.pyplot as plt

# Sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

x_vals = np.linspace(-10, 10, 100)
y_vals = sigmoid(x_vals)

plt.plot(x_vals, y_vals, label='Sigmoid', color='blue')
plt.title('Sigmoid Activation Function')
plt.xlabel('x')
plt.ylabel('Sigmoid(x)')
plt.legend()
plt.axvline(0, color='black', linestyle='--')
plt.axhline(0.5, color='black', linestyle='--')
plt.grid(True)
plt.show()

3.2 ReLU (Rectified Linear Unit)

\large f(x) = max(0, x)

  • Range: [0, \infty)
  • Usage: Very popular in deep learning models, especially for hidden layers.
  • Interpretation: If the input is positive, ReLU returns it as is; if it’s negative, it outputs 0.
  • Advantage: It avoids the vanishing gradient problem that can occur with sigmoid activations and is computationally efficient.
  • Drawback:
    1. If too many activations get below zero, most of the units in the network will output zero, which can prohibit learning. This is Dying ReLU Problem.
    2. All negative values are immediately converted to zero, which can decrease the model’s ability to fit or train from the data.
Code
def relu(x):
    return np.maximum(0, x)

y_vals_relu = relu(x_vals)

plt.plot(x_vals, y_vals_relu, label='ReLU', color='green')
plt.title('ReLU Activation Function')
plt.xlabel('x')
plt.ylabel('ReLU(x)')
plt.legend()
plt.axvline(0, color='black', linestyle='--')
plt.grid(True)
plt.show()

3.3 Tanh (Hyperbolic Tangent)

\large tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

  • Range: (-1, 1)
  • Usage:
    1. Often used in hidden layers of neural networks.
    2. It has been mostly used in recurrent neural networks for natural language processing and speech recognition tasks.
  • Interpretation: Similar to the sigmoid but with a range of (-1, 1). It centers the data better than sigmoid, help learning algorithms converge faster.
  • Advantage: Avoids the issue of mean-shift in data and is preferred over sigmoid in practice.
  • Drawback:
    1. Still suffers from the vanishing gradient problem, though less severe than the sigmoid.
    2. The tanh function can produce some dead neurons during the computation process.
Code
def tanh(x):
    return np.tanh(x)

y_vals_tanh = tanh(x_vals)

plt.plot(x_vals, y_vals_tanh, label='Tanh', color='red')
plt.title('Tanh Activation Function')
plt.xlabel('x')
plt.axvline(0, color='black', linestyle='--')
plt.axhline(0, color='black', linestyle='--')
plt.ylabel('Tanh(x)')
plt.legend()
plt.grid(True)
plt.show()

3.4 Leaky ReLU

\large f(x) = max(\alpha x, x)

where \alpha is a small positive constant (e.g., 0.01).

  • Range: (-\infty, \infty)
  • Usage: A variant of ReLU to address the “dying ReLU” problem.
  • Interpretation:
    1. Leaky ReLU allows small, non-zero gradients when the input is negative.
    2. Leaky ReLU is less sensitive to initialization than the traditional ReLU, which means it can be used with a wider range of initialization strategies.
  • Advantage: Leaky ReLU can provide a non-zero output for negative input values, which can help avoid discarding potentially important information.
  • Disadvantage: Leaky ReLU can lead to gradient explosion if the learning rate is set too high and is not suitable for all types of problems.
Code
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

x_vals = np.linspace(-100, 100, 100)
y_vals_leaky_relu = leaky_relu(x_vals)

plt.plot(x_vals, y_vals_leaky_relu, label='Leaky ReLU', color='purple')
plt.title('Leaky ReLU Activation Function')
plt.xlabel('x')
plt.ylabel('Leaky ReLU(x)')
plt.axhline(0, color='black', linestyle='--')
plt.axvline(0, color='black', linestyle='--')
plt.grid(True)
plt.xlim(-100,100)
plt.legend()
plt.show()

3.5 Softmax

\large S(y_i) = \frac{e^{y_i}}{\sum \limits _{k}e^{y_k}}

  • Range: (0, 1) and the sum of outputs is 1.
  • Usage: Primarily used in the output layer of a classification model to predict probabilities for multiple classes.
  • Interpretation:
    1. It is used in multi-class classification problems where you need to assign a probability to each class.
    2. The Softmax function converts a vector of raw scores (logits) into probabilities that sum to 1.
  • Advantages: Probabilities are easier to understand and communicate compared to raw output values
  • Disadvantage: Because the softmax function amplifies differences, it can be sensitive to outliers or extreme values.
Code
import plotly.express as px
import plotly.io as pio
import cufflinks as cf
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
pio.renderers.default = "iframe" 
cf.go_offline()


# Softmax function
def softmax(x):
    e_x = np.exp(x)  # Subtracting max for numerical stability
    return e_x / e_x.sum(axis=0)

# Create grid of input values for two variables
x1_vals = np.linspace(-5, 5, 100)
x2_vals = np.linspace(-5, 5, 100)
x1, x2 = np.meshgrid(x1_vals, x2_vals)

# Calculate softmax for the two variables and a third fixed input
z = np.zeros_like(x1)
for i in range(x1.shape[0]):
    for j in range(x1.shape[1]):
        inputs = [x1[i, j], x2[i, j], 1.0]  # Third input is fixed
        softmax_vals = softmax(inputs)
        z[i, j] = softmax_vals[0]  # Plot softmax value for the first class

# Create a 3D surface plot using Plotly
fig = go.Figure(data=[go.Surface(z=z, x=x1, y=x2)])

# Add labels and title
fig.update_layout(title="Softmax Function for 3 Inputs",
                  scene = dict(
                      xaxis_title='x1',
                      yaxis_title='x2',
                      zaxis_title='Softmax(x1)'
                  ))

# Show the plot
fig.show()

4. Perceptron Different Uses

The perceptron can be extended to handle various classification cases, including linear classification and multiclass classification, by modifying the loss and activation functions. Here’s how perceptron variations can be applied in different scenarios:

4.1 Linear Binary Classification (Traditional Perceptron)

  • Activation Function: The traditional perceptron uses a step function as its activation function, outputting either 0 or 1 (or equivalently -1 and 1).

    f(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}

  • Loss Function: The perceptron loss function is typically used here. When the classification is wrong, the weights are updated. This only works for linearly separable data.

    \large L(w, b) = \frac{1}{n} \sum_{i=1}^{n} \max(0, - y_i f(\mathbf{x_i}))

    We can also use Binary cross entropy or say, log loss for this

    \large L = -\frac{1}{n} \sum_{i=1}^n [y_i \log(\hat{y}_i) + (1-y_i) \log(1 - \hat{y}_i)]

    where, \hat{y_i} is predicted probability.

  • Limitations:

    1. Cannot handle non-linear data.
    2. Only works with binary classification.

4.2 Linear Multiclass Classification

To extend the perceptron for multiclass classification, we need a more sophisticated approach. One such approach is the one-vs-all (one-vs-rest) strategy, or we can use softmax regression for a single model.

a) One-vs-All Perceptron
  • Activation Function: Same as in binary classification.
  • Loss Function: Each class gets its own perceptron (classifier), and for each input, the classifier that outputs the highest score is chosen as the predicted class. The perceptron loss can be applied individually to each classifier.
b) Softmax Perceptron (Multinomial Logistic Regression)
  • Activation Function: Softmax function, which converts raw scores (logits) into probabilities for each class.

    \displaystyle\Rightarrow \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}

  • Loss Function: Cross-entropy loss, which penalizes the model for being incorrect based on how far the predicted probability is from the true label.

    \displaystyle\Rightarrow L = \sum_{i=1}^n - y_i \log(\hat{y}_i)

    where $ y_i $ is the true label and \hat{y}_i is the predicted probability.

  • Benefit: Multiclass classification can handle multiple classes simultaneously rather than separate classifiers for each class.

4.3 Non-Linear Classification (Neural Network Extension)

When the data is not linearly separable, the simple perceptron fails. To handle non-linear boundaries, the perceptron can be extended into a multi-layer perceptron (MLP), which is a form of feedforward neural network.

  • Activation Functions:

    • Sigmoid: A smooth, non-linear function that maps inputs to a range of (0,1).

      \displaystyle f(z) = \frac{1}{1 + e^{-z}}

    • ReLU (Rectified Linear Unit): A popular activation function for hidden layers.

      \displaystyle f(z) = \max(0, z)

    • Tanh: Similar to sigmoid but maps values to a range of (-1, 1).

      \displaystyle f(z) = \tanh(z) = \frac{2}{1 + e^{-2z}} - 1

    • Leaky ReLU: Modified ReLU

      \displaystyle f(x) = max(0.01 x, x)

  • Loss Function: For binary classification, you can use log loss. For multiclass classification, you can use cross-entropy loss as described above.

  • Example: A neural network with ReLU in the hidden layers and softmax in the output layer can handle non-linear, multiclass classification.

4.4 Linear Regression (Using Perceptron for Regression Tasks)

Although the perceptron is a classification algorithm, by changing the activation and loss functions, it can handle regression tasks.

  • Activation Function: The identity function f(z) = z, where the output is continuous, can be used.

  • Loss Function: The most commonly used loss function in regression is mean squared error (MSE).

    \displaystyle L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

This transforms the perceptron into a linear regression model.

Summary of Loss and Activation Functions for Different Cases

Task Activation Function Loss Function
Binary Classification Step function, Sigmoid Binary Cross Entropy, Perceptron loss
Multiclass Classification Softmax Cross-entropy loss
Non-Linear Classification Sigmoid, ReLU, Tanh Cross-entropy loss, Perceptron loss
Linear Regression Identity function (linear) Mean squared error (MSE)

5. Multilayer Perceptron (MLP)

MLP is a type of feed-forward neural network that allows solving more complex problems than a basic single-layer perceptron.

5.1 Feed-forward Neural Network (FFNN)

  • A Feed-forward Neural Network is a type of artificial neural network where the connections between the nodes (neurons) do not form cycles.

  • Information flows in one direction: from the input layer, through the hidden layers (if any), to the output layer.

  • No outputs from any layer are fed back into previous layers.

    The term “feed-forward” means that the data moves forward through the layers, and there are no loops or cycles in the network’s structure.

5.2 MLP Architecture

  • Input Layer: Takes in the raw input features.

  • Hidden Layers:

    1. One or more layers between the input and output layers.
    2. Each neuron in one layer is fully connected to all neurons in the next layer (a fully connected network).
    3. Non-linear activation functions like ReLU, tanh, or sigmoid are applied to each neuron’s output to introduce non-linearity, allowing the model to learn more complex functions.
  • Output Layer: Produces the final prediction.

    An MLP is a type of Feed-forward Neural Network because information flows in one direction, from input to output, without cycles or loops.

Multi-layer-perceptron-MLP-schematic-model

5.3 MLP Notation

When dealing with neural networks, the back-propagation algorithm—which we used to train the network to update weights and biases recursively and achieve maximum accuracy—is the most challenging to grasp.

Since there are many weights and biases that exit a neural network during training and back-propagation is a process of updating weights and biases, it becomes necessary to note all of the various weights and biases in order to comprehend the fundamental concepts of neural network back-propagation.

5.3.1 Notation for Weights :

Notation: \large \mathbf{W^{h}_{ij}} , where:

\mathbf{i} = From which node weight is passing to the next layer’s node.
\mathbf{j}= To which node weight is arriving.
\mathbf{h} = Layer in which weight is arriving.

Example

  • \large \mathbf{W^{1}_{23}} = Weight is passing from 2nd node of previous layer → 3rd node of 1st layer.
  • \large \mathbf{W^{2}_{43}} = Weight is passing from 4th node of previous layer → 3rd node of 2nd layer.
  • \large \mathbf{W^{3}_{13}} = Weight is passing from 1st node of previous layer → 3rd node of 3rd layer.
5.3.2 Notation for bias :

Notation: \large \mathbf{b_{ij}} , where:

\mathbf{i} = Layer to which bias belong.
\mathbf{j}= node to which bias belong.

Example

  • \large \mathbf{b_{12}} = bias of 2nd node of 1st layer.
  • \large \mathbf{b_{31}} = bias of 1st node of 3rd layer.
  • \large \mathbf{b_{23}} = bias of 3rd node of 2nd layer.
5.3.3 Notation for Output :

Notation: \large \mathbf{O_{ij}} , where:

\mathbf{i} = Layer to which output belong.
\mathbf{j}= node to which output belong.

Example

  • \large \mathbf{O_{12}} = output of 2nd node of 1st layer.
  • \large \mathbf{O_{31}} = output of 1st node of 3rd layer.
  • \large \mathbf{O_{23}} = output of 3rd node of 2nd layer.

image.png

5.4 Total Trainable parameters

Total Trainable Parameters for a given between 2 layers of an artificial neural network is a sum of total weights and total biases that exist between them.

\text{Total Trainable Parameters Between two layers in a Neural Network =}
\text{[ Number of Nodes in the first layer * Number of nodes in the second layer ] + [ Number of nodes in the second layer ]}

Example

  • Trainable Parameters Between the Input layer and Hidden Layer 1:
    \text{weights} = 3 * 3 = 9
    \text{biases} = 3 \text{ (3 nodes in hidden layer 1)}
    \text{Trainable parameters = weights + biases = } 9 + 3 = 12

  • Trainable Parameters Between the Hidden Layer 1 and Hidden Layer 2:
    \text{weights} = 3 * 2 = 6
    \text{biases} = 2 \text{ (2 nodes in hidden layer 2)}
    \text{Trainable parameters = weights + biases = } 6 + 2 = 8

  • Trainable Parameters Between the Hidden Layer 1 and Output Layer:
    \text{weights} = 2 * 1 = 2
    \text{biases} = 1 \text{ (1 nodes in output layer)}
    \text{Trainable parameters = weights + biases = } 2 + 1 = 3

\text{Total Trainable parameters = } 12 + 8 + 3 = 23